Precision and recall

In pattern recognition and information retrieval, precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. When a program for recognizing the dogs in a scene correctly identifies four of the nine dogs but mistakes three cats for dogs, its precision is 4/7 while its recall is 4/9. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.

In statistics, if the null hypothesis is that all and only the relevant items are retrieved, absence of type I and type II errors corresponds respectively to maximum precision (no false positives) and maximum recall (no false negatives). The above pattern recognition example contained 7 − 4 = 3 type I errors and 9 − 4 = 5 type II errors. Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity.

In even simpler terms, high recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant.

Contents

Introduction

As an example, in an information retrieval scenario, the instances are documents and the task is to return a set of relevant documents given a search term; or equivalently, to assign each document to one of two categories, "relevant" and "not relevant". In this case, the "relevant" documents are simply those that belong to the "relevant" category. Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).

In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).

In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C).

Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. For example, an information retrieval system (such as a search engine) can often increase its recall by retrieving more documents, at the cost of increasing number of irrelevant documents retrieved (decreasing precision). Similarly, a classification system for deciding whether or not, say, a fruit is an orange, can achieve high precision by only classifying fruits with the exact right shape and color as oranges, but at the cost of low recall due to the number of false negatives from oranges that did not quite match the specification.

Usually, precision and recall scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other measure (e.g. precision at a recall level of 0.75) or both are combined into a single measure, such as the F-measure, which is the weighted harmonic mean of precision and recall (see below), or the Matthews correlation coefficient.

Definition (information retrieval context)

In information retrieval contexts, precision and recall are defined in terms of a set of retrieved documents (e.g. the list of documents produced by a web search engine for a query) and a set of relevant documents (e.g. the list of all documents on the internet that are relevant for a certain topic), cf. relevance.

Precision

In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the search:

 \text{precision}=\frac{|\{\text{relevant documents}\}\cap\{\text{retrieved documents}\}|}{|\{\text{retrieved documents}\}|}

Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n.

For example for a text search on a set of documents precision is the number of correct results divided by the number of all returned results.

Precision is also used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system.

Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.

Recall

Recall in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved.

 \text{recall}=\frac{|\{\text{relevant documents}\}\cap\{\text{retrieved documents}\}|}{|\{\text{relevant documents}\}|}

For example for text search on a set of documents recall is the number of correct results divided by the number of results that should have been returned

In binary classification, recall is called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.

Definition (classification context)

For classification tasks, the terms true positives, true negatives, false positives, and false negatives (see also Type I and type II errors) compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the observation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the expectation). This is illustrated by the table below:

actual class
(expectation)
predicted class
(observation)
tp
(true positive)
Correct result
fp
(false positive)
Unexpected result
fn
(false negative)
Missing result
tn
(true negative)
Correct absence of result

Precision and recall are then defined as:[1]

\text{Precision}=\frac{tp}{tp%2Bfp} \,
\text{Recall}=\frac{tp}{tp%2Bfn} \,

Recall in this context is also referred to as the True Positive Rate, other related measures used in classification include True Negative Rate and Accuracy:.[1] True Negative Rate is also called Specificity.

\text{True negative rate}=\frac{tn}{tn%2Bfp} \,
\text{Accuracy}=\frac{tp%2Btn}{tp%2Btn%2Bfp%2Bfn} \,

Probabilistic interpretation

It is possible to interpret precision and recall not as ratios but as probabilities:

Note that the random selection refers to a uniform distribution over the appropriate pool of documents; i.e. by randomly selected retrieved document, we mean selecting a document from the set of retrieved documents in a random fashion. The random selection should be such that all documents in the set are equally likely to be selected.

Note that, in a typical classification system, the probability that a retrieved document is relevant depends on the document. The above interpretation extends to that scenario also (needs explanation).

Another interpretation for precision and recall is as follows. Precision is the average probability of relevant retrieval. Recall is the average probability of complete retrieval. Here we average over multiple retrieval queries.

F-measure

A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score:

F = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{ \mathrm{precision} %2B \mathrm{recall}}

This is also known as the F_1 measure, because recall and precision are evenly weighted.

It is a special case of the general F_\beta measure (for non-negative real values of \beta):

F_\beta = (1 %2B \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall} }{ \beta^2 \cdot \mathrm{precision} %2B \mathrm{recall}}

Two other commonly used F measures are the F_2 measure, which weights recall higher than precision, and the F_{0.5} measure, which puts more emphasis on precision than recall.

The F-measure was derived by van Rijsbergen (1979) so that F_\beta "measures the effectiveness of retrieval with respect to a user who attaches \beta times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure E = 1 - \frac{1}{\frac{\alpha}{P} %2B \frac{1-\alpha}{R}}. Their relationship is F_\beta = 1 - E where \alpha=\frac{1}{1 %2B \beta^2}.

Limitations as goals

There are other parameters and strategies for performance metric of information retrieval system. In particular, for web document retrieval, if the user's objectives are not clear, the precision and recall can't be optimized. As summarized by D. Lopresti,

"Browsing is a comfortable and powerful paradigm (the serendipity effect).
  • Search results don't have to be very good.
  • Recall? Not important (as long as you get at least some good hits).
  • Precision? Not important (as long as at least some of the hits on the first page you return are good).
"[2]

See also

Sources

  1. ^ a b Olson, David L.; Delen, Dursun ”Advanced Data Mining Techniques” Springer; 1 edition (February 1, 2008), page 138, ISBN 3540769161
  2. ^ Daniel Lopresti 2001, WDA 2001 panel

External links